Project¶

1. Title:¶

Hotel Reviews Sentiment Analysis

2. Brief on the project:¶

The data was scraped from Booking.com. It contains reviews of hotels present at multiple geographical locations.

3. About the dataset:¶

This dataset contains 515,000 customer reviews and scoring of 1493 luxury hotels across Europe. Meanwhile, the geographical location of hotels are also provided for further analysis. The csv file contains 17 fields. The description of each field is as below:

Hotel_Address: Address of hotel.

Review_Date: Date when reviewer posted the corresponding review.

Average_Score: Average Score of the hotel, calculated based on the latest comment in the last year.

Hotel_Name: Name of Hotel

Reviewer_Nationality: Nationality of Reviewer

Negative_Review: Negative Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Negative'

Review_Total_Negative_Word_Counts: Total number of words in the negative review.

Positive_Review: Positive Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Positive'

Review_Total_Positive_Word_Counts: Total number of words in the positive review.

Reviewer_Score: Score the reviewer has given to the hotel, based on his/her experience

Total_Number_of_Reviews_Reviewer_Has_Given: Number of Reviews the reviewers has given in the past.

Total_Number_of_Reviews: Total number of valid reviews the hotel has.

Tags: Tags reviewer gave the hotel.

days_since_review: Duration between the review date and scrape date.

Additional_Number_of_Scoring: There are also some guests who just made a scoring on the service rather than a review. This number indicates how many valid scores without review in there.

lat: Latitude of the hotel

lng: longtitude of the hotel

4. Objective:¶

To do sentiment analysis of reviews

5. Individual Details:¶

Name: Vidit Kumar Pal, Email: vidit.20.pal@gmail.com, Contact: +91-7985431988

Importing Libraries¶

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import emoji
import string
import nltk
from PIL import Image
from collections import Counter
from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC,LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
import pickle
In [2]:
data=pd.read_csv('Hotel_Reviews.csv')
In [3]:
data.head()
Out[3]:
Hotel_Address Additional_Number_of_Scoring Review_Date Average_Score Hotel_Name Reviewer_Nationality Negative_Review Review_Total_Negative_Word_Counts Total_Number_of_Reviews Positive_Review Review_Total_Positive_Word_Counts Total_Number_of_Reviews_Reviewer_Has_Given Reviewer_Score Tags days_since_review lat lng
0 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 8/3/2017 7.7 Hotel Arena Russia I am so angry that i made this post available... 397 1403 Only the park outside of the hotel was beauti... 11 7 2.9 [' Leisure trip ', ' Couple ', ' Duplex Double... 0 days 52.360576 4.915968
1 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 8/3/2017 7.7 Hotel Arena Ireland No Negative 0 1403 No real complaints the hotel was great great ... 105 7 7.5 [' Leisure trip ', ' Couple ', ' Duplex Double... 0 days 52.360576 4.915968
2 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 7/31/2017 7.7 Hotel Arena Australia Rooms are nice but for elderly a bit difficul... 42 1403 Location was good and staff were ok It is cut... 21 9 7.1 [' Leisure trip ', ' Family with young childre... 3 days 52.360576 4.915968
3 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 7/31/2017 7.7 Hotel Arena United Kingdom My room was dirty and I was afraid to walk ba... 210 1403 Great location in nice surroundings the bar a... 26 1 3.8 [' Leisure trip ', ' Solo traveler ', ' Duplex... 3 days 52.360576 4.915968
4 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 7/24/2017 7.7 Hotel Arena New Zealand You When I booked with your company on line y... 140 1403 Amazing location and building Romantic setting 8 3 6.7 [' Leisure trip ', ' Couple ', ' Suite ', ' St... 10 days 52.360576 4.915968
In [4]:
data.tail()
Out[4]:
Hotel_Address Additional_Number_of_Scoring Review_Date Average_Score Hotel_Name Reviewer_Nationality Negative_Review Review_Total_Negative_Word_Counts Total_Number_of_Reviews Positive_Review Review_Total_Positive_Word_Counts Total_Number_of_Reviews_Reviewer_Has_Given Reviewer_Score Tags days_since_review lat lng
515733 Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ... 168 8/30/2015 8.1 Atlantis Hotel Vienna Kuwait no trolly or staff to help you take the lugga... 14 2823 location 2 8 7.0 [' Leisure trip ', ' Family with older childre... 704 day 48.203745 16.335677
515734 Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ... 168 8/22/2015 8.1 Atlantis Hotel Vienna Estonia The hotel looks like 3 but surely not 4 11 2823 Breakfast was ok and we got earlier check in 11 12 5.8 [' Leisure trip ', ' Family with young childre... 712 day 48.203745 16.335677
515735 Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ... 168 8/19/2015 8.1 Atlantis Hotel Vienna Egypt The ac was useless It was a hot week in vienn... 19 2823 No Positive 0 3 2.5 [' Leisure trip ', ' Family with older childre... 715 day 48.203745 16.335677
515736 Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ... 168 8/17/2015 8.1 Atlantis Hotel Vienna Mexico No Negative 0 2823 The rooms are enormous and really comfortable... 25 3 8.8 [' Leisure trip ', ' Group ', ' Standard Tripl... 717 day 48.203745 16.335677
515737 Wurzbachgasse 21 15 Rudolfsheim F nfhaus 1150 ... 168 8/9/2015 8.1 Atlantis Hotel Vienna Hungary I was in 3rd floor It didn t work Free Wife 13 2823 staff was very kind 6 1 8.3 [' Leisure trip ', ' Family with young childre... 725 day 48.203745 16.335677
In [5]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 515738 entries, 0 to 515737
Data columns (total 17 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   Hotel_Address                               515738 non-null  object 
 1   Additional_Number_of_Scoring                515738 non-null  int64  
 2   Review_Date                                 515738 non-null  object 
 3   Average_Score                               515738 non-null  float64
 4   Hotel_Name                                  515738 non-null  object 
 5   Reviewer_Nationality                        515738 non-null  object 
 6   Negative_Review                             515738 non-null  object 
 7   Review_Total_Negative_Word_Counts           515738 non-null  int64  
 8   Total_Number_of_Reviews                     515738 non-null  int64  
 9   Positive_Review                             515738 non-null  object 
 10  Review_Total_Positive_Word_Counts           515738 non-null  int64  
 11  Total_Number_of_Reviews_Reviewer_Has_Given  515738 non-null  int64  
 12  Reviewer_Score                              515738 non-null  float64
 13  Tags                                        515738 non-null  object 
 14  days_since_review                           515738 non-null  object 
 15  lat                                         512470 non-null  float64
 16  lng                                         512470 non-null  float64
dtypes: float64(4), int64(5), object(8)
memory usage: 66.9+ MB
In [6]:
data.isnull().sum()
Out[6]:
Hotel_Address                                    0
Additional_Number_of_Scoring                     0
Review_Date                                      0
Average_Score                                    0
Hotel_Name                                       0
Reviewer_Nationality                             0
Negative_Review                                  0
Review_Total_Negative_Word_Counts                0
Total_Number_of_Reviews                          0
Positive_Review                                  0
Review_Total_Positive_Word_Counts                0
Total_Number_of_Reviews_Reviewer_Has_Given       0
Reviewer_Score                                   0
Tags                                             0
days_since_review                                0
lat                                           3268
lng                                           3268
dtype: int64
In [7]:
data.dropna(inplace=True,axis=0)
In [8]:
data.isnull().sum()
Out[8]:
Hotel_Address                                 0
Additional_Number_of_Scoring                  0
Review_Date                                   0
Average_Score                                 0
Hotel_Name                                    0
Reviewer_Nationality                          0
Negative_Review                               0
Review_Total_Negative_Word_Counts             0
Total_Number_of_Reviews                       0
Positive_Review                               0
Review_Total_Positive_Word_Counts             0
Total_Number_of_Reviews_Reviewer_Has_Given    0
Reviewer_Score                                0
Tags                                          0
days_since_review                             0
lat                                           0
lng                                           0
dtype: int64
In [9]:
data['Negative_Review'].value_counts()
Out[9]:
No Negative                                                                    127035
 Nothing                                                                        14227
 Nothing                                                                         4212
 nothing                                                                         2211
 N A                                                                             1032
                                                                                ...  
 Room wasn t ready rooms freezing hotel basic and outdated                          1
 not so close to underground                                                        1
 There was a terrible smell when you switched on the light in the bathroom          1
 A bit far with underground walk more than 5 minutes                                1
 I was in 3rd floor It didn t work Free Wife                                        1
Name: Negative_Review, Length: 327927, dtype: int64
In [10]:
data.describe()
Out[10]:
Additional_Number_of_Scoring Average_Score Review_Total_Negative_Word_Counts Total_Number_of_Reviews Review_Total_Positive_Word_Counts Total_Number_of_Reviews_Reviewer_Has_Given Reviewer_Score lat lng
count 512470.000000 512470.000000 512470.000000 512470.000000 512470.000000 512470.000000 512470.000000 512470.000000 512470.000000
mean 500.118391 8.397934 18.541864 2747.504902 17.765052 7.152272 8.395594 49.442439 2.823803
std 501.419262 0.549133 29.693695 2322.698454 21.789025 11.028943 1.638170 3.466325 4.579425
min 1.000000 5.200000 0.000000 43.000000 0.000000 1.000000 2.500000 41.328376 -0.369758
25% 169.000000 8.100000 2.000000 1161.000000 5.000000 1.000000 7.500000 48.214662 -0.143372
50% 343.000000 8.400000 9.000000 2134.000000 11.000000 3.000000 8.800000 51.499981 0.010607
75% 666.000000 8.800000 23.000000 3633.000000 22.000000 8.000000 9.600000 51.516288 4.834443
max 2682.000000 9.800000 408.000000 16670.000000 395.000000 355.000000 10.000000 52.400181 16.429233
In [11]:
data.describe(include='object').T
Out[11]:
count unique top freq
Hotel_Address 512470 1476 163 Marsh Wall Docklands Tower Hamlets London ... 4789
Review_Date 512470 731 8/2/2017 2584
Hotel_Name 512470 1475 Britannia International Hotel Canary Wharf 4789
Reviewer_Nationality 512470 227 United Kingdom 244457
Negative_Review 512470 327927 No Negative 127035
Positive_Review 512470 409941 No Positive 35737
Tags 512470 54934 [' Leisure trip ', ' Couple ', ' Double Room '... 5100
days_since_review 512470 731 1 days 2584
In [12]:
data["Hotel_Address"].head(10)
Out[12]:
0     s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
1     s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
2     s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
3     s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
4     s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
5     s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
6     s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
7     s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
8     s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
9     s Gravesandestraat 55 Oost 1092 AA Amsterdam ...
Name: Hotel_Address, dtype: object
In [13]:
print("Duplicated rows before: ",data.duplicated().sum())
data.drop_duplicates(inplace=True)
print("Duplicated rows after: ",data.duplicated().sum())
Duplicated rows before:  526
Duplicated rows after:  0
In [14]:
data["Hotel_Address"]=data["Hotel_Address"].str.replace("United Kingdom","UK")
In [15]:
data.head()
Out[15]:
Hotel_Address Additional_Number_of_Scoring Review_Date Average_Score Hotel_Name Reviewer_Nationality Negative_Review Review_Total_Negative_Word_Counts Total_Number_of_Reviews Positive_Review Review_Total_Positive_Word_Counts Total_Number_of_Reviews_Reviewer_Has_Given Reviewer_Score Tags days_since_review lat lng
0 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 8/3/2017 7.7 Hotel Arena Russia I am so angry that i made this post available... 397 1403 Only the park outside of the hotel was beauti... 11 7 2.9 [' Leisure trip ', ' Couple ', ' Duplex Double... 0 days 52.360576 4.915968
1 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 8/3/2017 7.7 Hotel Arena Ireland No Negative 0 1403 No real complaints the hotel was great great ... 105 7 7.5 [' Leisure trip ', ' Couple ', ' Duplex Double... 0 days 52.360576 4.915968
2 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 7/31/2017 7.7 Hotel Arena Australia Rooms are nice but for elderly a bit difficul... 42 1403 Location was good and staff were ok It is cut... 21 9 7.1 [' Leisure trip ', ' Family with young childre... 3 days 52.360576 4.915968
3 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 7/31/2017 7.7 Hotel Arena United Kingdom My room was dirty and I was afraid to walk ba... 210 1403 Great location in nice surroundings the bar a... 26 1 3.8 [' Leisure trip ', ' Solo traveler ', ' Duplex... 3 days 52.360576 4.915968
4 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 7/24/2017 7.7 Hotel Arena New Zealand You When I booked with your company on line y... 140 1403 Amazing location and building Romantic setting 8 3 6.7 [' Leisure trip ', ' Couple ', ' Suite ', ' St... 10 days 52.360576 4.915968
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [16]:
data[data['Average_Score']==9.8].Hotel_Name
Out[16]:
54717    Ritz Paris
54718    Ritz Paris
54719    Ritz Paris
54720    Ritz Paris
54721    Ritz Paris
54722    Ritz Paris
54723    Ritz Paris
54724    Ritz Paris
54725    Ritz Paris
54726    Ritz Paris
54727    Ritz Paris
54728    Ritz Paris
54729    Ritz Paris
54730    Ritz Paris
54731    Ritz Paris
54732    Ritz Paris
54733    Ritz Paris
54734    Ritz Paris
54735    Ritz Paris
54736    Ritz Paris
54737    Ritz Paris
54738    Ritz Paris
54739    Ritz Paris
54740    Ritz Paris
54741    Ritz Paris
54742    Ritz Paris
54743    Ritz Paris
54744    Ritz Paris
Name: Hotel_Name, dtype: object
In [17]:
plt.figure(figsize=(15,10))
sns.heatmap(data=data.corr(),annot=True)
Out[17]:
<AxesSubplot:>
In [18]:
sns.countplot(data=data[data['Reviewer_Score']==10],x=data[data['Reviewer_Score']==10].Reviewer_Nationality.head(200))
plt.xticks(rotation=90)
Out[18]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34]),
 [Text(0, 0, ' United Kingdom '),
  Text(1, 0, ' Italy '),
  Text(2, 0, ' Netherlands '),
  Text(3, 0, ' United States of America '),
  Text(4, 0, ' Ireland '),
  Text(5, 0, ' Australia '),
  Text(6, 0, ' Canada '),
  Text(7, 0, ' Argentina '),
  Text(8, 0, ' France '),
  Text(9, 0, ' Russia '),
  Text(10, 0, ' Croatia '),
  Text(11, 0, ' United Arab Emirates '),
  Text(12, 0, ' Panama '),
  Text(13, 0, ' New Zealand '),
  Text(14, 0, ' Norway '),
  Text(15, 0, ' India '),
  Text(16, 0, ' Israel '),
  Text(17, 0, ' Isle of Man '),
  Text(18, 0, ' Liechtenstein '),
  Text(19, 0, ' United States Minor Outlying Islands '),
  Text(20, 0, ' Morocco '),
  Text(21, 0, ' Oman '),
  Text(22, 0, ' Germany '),
  Text(23, 0, ' Belgium '),
  Text(24, 0, ' Spain '),
  Text(25, 0, ' China '),
  Text(26, 0, ' Greece '),
  Text(27, 0, ' Sweden '),
  Text(28, 0, ' Taiwan '),
  Text(29, 0, ' Lebanon '),
  Text(30, 0, ' Thailand '),
  Text(31, 0, ' Japan '),
  Text(32, 0, ' Turkey '),
  Text(33, 0, ' Saudi Arabia '),
  Text(34, 0, ' Slovakia ')])
In [19]:
sns.countplot(data=data[data['Reviewer_Score']==2.5],x=data[data['Reviewer_Score']==2.5].Reviewer_Nationality.head(200))
plt.xticks(rotation=90)
Out[19]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37]),
 [Text(0, 0, ' United Kingdom '),
  Text(1, 0, ' Saudi Arabia '),
  Text(2, 0, ' France '),
  Text(3, 0, ' United States of America '),
  Text(4, 0, ' Netherlands '),
  Text(5, 0, ' South Africa '),
  Text(6, 0, ' Ireland '),
  Text(7, 0, ' Malaysia '),
  Text(8, 0, ' Philippines '),
  Text(9, 0, ' Fiji '),
  Text(10, 0, ' United Arab Emirates '),
  Text(11, 0, ' Turkey '),
  Text(12, 0, ' Germany '),
  Text(13, 0, ' Egypt '),
  Text(14, 0, ' Bahrain '),
  Text(15, 0, ' Romania '),
  Text(16, 0, ' Portugal '),
  Text(17, 0, ' Japan '),
  Text(18, 0, ' Qatar '),
  Text(19, 0, ' Belarus '),
  Text(20, 0, ' Spain '),
  Text(21, 0, ' Lithuania '),
  Text(22, 0, ' Lebanon '),
  Text(23, 0, ' Russia '),
  Text(24, 0, ' Hong Kong '),
  Text(25, 0, ' Namibia '),
  Text(26, 0, ' Greece '),
  Text(27, 0, ' Kuwait '),
  Text(28, 0, ' Vietnam '),
  Text(29, 0, ' Australia '),
  Text(30, 0, ' Italy '),
  Text(31, 0, ' China '),
  Text(32, 0, ' Brazil '),
  Text(33, 0, ' Ukraine '),
  Text(34, 0, ' Belgium '),
  Text(35, 0, ' '),
  Text(36, 0, ' Nigeria '),
  Text(37, 0, ' Indonesia ')])
In [20]:
data['Review_Date']=pd.to_datetime(data['Review_Date'])
data['years']=data['Review_Date'].dt.year
data['months']=data['Review_Date'].dt.month
In [21]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 511944 entries, 0 to 515737
Data columns (total 19 columns):
 #   Column                                      Non-Null Count   Dtype         
---  ------                                      --------------   -----         
 0   Hotel_Address                               511944 non-null  object        
 1   Additional_Number_of_Scoring                511944 non-null  int64         
 2   Review_Date                                 511944 non-null  datetime64[ns]
 3   Average_Score                               511944 non-null  float64       
 4   Hotel_Name                                  511944 non-null  object        
 5   Reviewer_Nationality                        511944 non-null  object        
 6   Negative_Review                             511944 non-null  object        
 7   Review_Total_Negative_Word_Counts           511944 non-null  int64         
 8   Total_Number_of_Reviews                     511944 non-null  int64         
 9   Positive_Review                             511944 non-null  object        
 10  Review_Total_Positive_Word_Counts           511944 non-null  int64         
 11  Total_Number_of_Reviews_Reviewer_Has_Given  511944 non-null  int64         
 12  Reviewer_Score                              511944 non-null  float64       
 13  Tags                                        511944 non-null  object        
 14  days_since_review                           511944 non-null  object        
 15  lat                                         511944 non-null  float64       
 16  lng                                         511944 non-null  float64       
 17  years                                       511944 non-null  int64         
 18  months                                      511944 non-null  int64         
dtypes: datetime64[ns](1), float64(4), int64(7), object(7)
memory usage: 78.1+ MB
In [22]:
sns.pointplot(data=data,x=data['years'],y=data['Total_Number_of_Reviews'])
plt.xticks(rotation=90)
Out[22]:
(array([0, 1, 2]),
 [Text(0, 0, '2015'), Text(1, 0, '2016'), Text(2, 0, '2017')])
In [23]:
sns.lineplot(data=data,x=data['months'],y=data['Total_Number_of_Reviews'])
Out[23]:
<AxesSubplot:xlabel='months', ylabel='Total_Number_of_Reviews'>
In [24]:
sns.lineplot(data=data,x=data['months'],y=data['Review_Total_Negative_Word_Counts'])
Out[24]:
<AxesSubplot:xlabel='months', ylabel='Review_Total_Negative_Word_Counts'>
In [25]:
sns.lineplot(data=data,x=data['months'],y=data['Review_Total_Positive_Word_Counts'])
Out[25]:
<AxesSubplot:xlabel='months', ylabel='Review_Total_Positive_Word_Counts'>
In [26]:
sns.lineplot(data=data,x=data['months'],y=data['Average_Score'])
Out[26]:
<AxesSubplot:xlabel='months', ylabel='Average_Score'>
In [27]:
sns.lineplot(data=data,x=data['months'],y=data['Reviewer_Score'])
Out[27]:
<AxesSubplot:xlabel='months', ylabel='Reviewer_Score'>
In [28]:
sns.countplot(data=data,x=data['Average_Score'])
plt.xticks(rotation=90)
Out[28]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]),
 [Text(0, 0, '5.2'),
  Text(1, 0, '6.4'),
  Text(2, 0, '6.6'),
  Text(3, 0, '6.7'),
  Text(4, 0, '6.8'),
  Text(5, 0, '6.9'),
  Text(6, 0, '7.0'),
  Text(7, 0, '7.1'),
  Text(8, 0, '7.2'),
  Text(9, 0, '7.3'),
  Text(10, 0, '7.4'),
  Text(11, 0, '7.5'),
  Text(12, 0, '7.6'),
  Text(13, 0, '7.7'),
  Text(14, 0, '7.8'),
  Text(15, 0, '7.9'),
  Text(16, 0, '8.0'),
  Text(17, 0, '8.1'),
  Text(18, 0, '8.2'),
  Text(19, 0, '8.3'),
  Text(20, 0, '8.4'),
  Text(21, 0, '8.5'),
  Text(22, 0, '8.6'),
  Text(23, 0, '8.7'),
  Text(24, 0, '8.8'),
  Text(25, 0, '8.9'),
  Text(26, 0, '9.0'),
  Text(27, 0, '9.1'),
  Text(28, 0, '9.2'),
  Text(29, 0, '9.3'),
  Text(30, 0, '9.4'),
  Text(31, 0, '9.5'),
  Text(32, 0, '9.6'),
  Text(33, 0, '9.8')])
In [29]:
import plotly.express as px
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
In [30]:
data.Reviewer_Nationality.nunique()
Out[30]:
227
In [31]:
# Get the top 10 reviewer nationalities with the most reviews
nationality = data["Reviewer_Nationality"].value_counts(dropna=False)[:10]

# Create a bar chart of the nationalities and review counts
fig = px.bar(x=nationality.index, y=nationality.values, color=nationality.index,
             title="Top 10 Nationalities of Reviewers")
fig.update_layout(xaxis_title="Nationality", yaxis_title="Review Count", font=dict(size=14))
fig.show()
In [32]:
data["Hotel_Name"].nunique()
Out[32]:
1475
In [33]:
# Get the top 10 hotels with the most reviews
names = data["Hotel_Name"].value_counts(dropna=False)[:10]

fig = px.bar(x=names.index, y=names.values, color=names.index,
             title="Top 10 Hotels with the Most Reviews")
fig.update_layout(xaxis_title="Hotel Name", yaxis_title="Review Count", font=dict(size=14))
fig.show()
In [34]:
fig = px.histogram(data, x="Reviewer_Score", title='Review Score Distribution', nbins=20, text_auto=True)
fig.show()
In [35]:
fig = px.histogram(data, x="Average_Score", title='Review Average Score Distribution')
fig.show()
In [36]:
data['Negative_Review'][1]
Out[36]:
'No Negative'
In [37]:
data.loc[:, 'Positive_Review'] = data.Positive_Review.apply(lambda x: x.replace('No Positive', ''))
data.loc[:, 'Negative_Review'] = data.Negative_Review.apply(lambda x: x.replace('No Negative', ''))
In [38]:
data['Negative_Review'][1]
Out[38]:
''
In [39]:
data["Total_Review"] = data["Negative_Review"] + data["Positive_Review"]
In [40]:
data["review_type"] = data["Reviewer_Score"].apply(
    lambda x: "Bad_review" if x < 7 else "Good_review")
In [41]:
df_reviews = data[["Total_Review", "review_type"]]
In [42]:
df_reviews
Out[42]:
Total_Review review_type
0 I am so angry that i made this post available... Bad_review
1 No real complaints the hotel was great great ... Good_review
2 Rooms are nice but for elderly a bit difficul... Good_review
3 My room was dirty and I was afraid to walk ba... Bad_review
4 You When I booked with your company on line y... Bad_review
... ... ...
515733 no trolly or staff to help you take the lugga... Good_review
515734 The hotel looks like 3 but surely not 4 Brea... Bad_review
515735 The ac was useless It was a hot week in vienn... Bad_review
515736 The rooms are enormous and really comfortable... Good_review
515737 I was in 3rd floor It didn t work Free Wife ... Good_review

511944 rows × 2 columns

In [43]:
fig = px.histogram(df_reviews, x="review_type", title='Review Type Distribution', text_auto=True)
fig.show()
In [44]:
df_reviews[df_reviews.review_type == 'Good_review'].Total_Review.value_counts()
Out[44]:
 Location                                                                      940
 Nothing Everything                                                            936
 Everything                                                                    597
 Great location                                                                252
 Everything                                                                    203
                                                                              ... 
 Outdated hotel rooms a bit shabby arogant receptionists Excellent location      1
 Staff unobtrusive but efficient Queries answered in a helpful manner            1
 Room Service food was awful but breakfast was good                              1
 all good location                                                               1
 I was in 3rd floor It didn t work Free Wife  staff was very kind                1
Name: Total_Review, Length: 411312, dtype: int64
In [45]:
df_reviews[df_reviews.review_type == 'Bad_review'].Total_Review.value_counts()
Out[45]:
 Everything Nothing                                                                                    123
 Location                                                                                              105
 Nothing                                                                                                36
 location                                                                                               26
 Staff                                                                                                  22
                                                                                                      ... 
 The hotel is not four star                                                                              1
 The staff checking us in was rude and very un polite The breakfast was cold and tasted disgusting       1
 Noisy fan so couldn t sleep kettle didn t work cold shower                                              1
 Hotel isn t good marked from the street no window not clear bed clothes dirty mirror good terry         1
 The ac was useless It was a hot week in vienna and it only gave more hot air                            1
Name: Total_Review, Length: 85115, dtype: int64
Resample Dataset¶

Under sample the positive review to achieve a balanced distribution between reviews

In [46]:
good_reviews = df_reviews[df_reviews.review_type == "Good_review"]
bad_reviews = df_reviews[df_reviews.review_type == "Bad_review"]
In [47]:
good_df = good_reviews.sample(n=len(bad_reviews), random_state=42)

df_review_resampled = good_df.append(bad_reviews).reset_index(drop=True)
df_review_resampled.shape
Out[47]:
(172350, 2)
In [48]:
df_review_resampled.head()
Out[48]:
Total_Review review_type
0 Being really picky here as all was great but ... Good_review
1 We were given unbeknown to us a handicap acce... Good_review
2 Location a little restrictive Hotel facilities Good_review
3 Staff service at the bar was appalling Bed w... Good_review
4 No information in rooms about London and some... Good_review
In [49]:
df_review_resampled.rename(columns={'Total_Review':'text'}, inplace=True)
In [50]:
sns.countplot(
  x='review_type',
  data=df_review_resampled,
  order=df_review_resampled.review_type.value_counts().index
)

plt.xlabel("type")
plt.title("Review type (resampled)");

Preprocessing of Text¶

Function to Remove Emojis¶

In [51]:
def strip_emoji(text):
    return emoji.replace_emoji(text,replace="")

Function to Convert text to lowercase,remove (/r,/n characters), URLs, non-utf characters, Numbers, punctutations¶

In [52]:
def strip_all_entities(text):
    text = text.replace('\r', '').replace('\n', ' ').lower()
    text = re.sub(r"(?:\@|https?\://)\S+", "", text)
    text = re.sub(r'[^\x00-\x7f]',r'', text)
    text = re.sub(r'(.)1+', r'1', text)
    text = re.sub('[0-9]+', '', text)
    stopchars= string.punctuation
    table = str.maketrans('', '', stopchars)
    text = text.translate(table)
    text = ' '.join(text)
    return text

Function to remove contractions¶

In [53]:
def decontract(text):
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    return text

Function to Clean Hashtags¶

In [54]:
def clean_hashtags(tweet):
    new_tweet = " ".join(word.strip() for word in re.split('#(?!(?:hashtag)\b)[\w-]+(?=(?:\s+#[\w-]+)*\s*$)', tweet))
    new_tweet2 = " ".join(word.strip() for word in re.split('#|_', new_tweet))
    return new_tweet2

Function to FIlter Special Characters such as $, &¶

In [55]:
def filter_chars(a):
    sent = []
    for word in a.split(' '):
        if ('$' in word) | ('&' in word):
            sent.append('')
        else:
            sent.append(word)
    return ' '.join(sent)

Function to remove multiple sequence spaces¶

In [56]:
def remove_mult_spaces(text):
    return re.sub("\s\s+"," ",text)

Function to apply stemming to words¶

In [57]:
def stemmer(text):
    tokenized = nltk.word_tokenize(text)
    ps = PorterStemmer()
    return ' '.join([ps.stem(words) for words in tokenized])

Function to apply lemmatization to words¶

In [58]:
def lemmatize(text):
    tokenized = nltk.word_tokenize(text)
    lm = WordNetLemmatizer()
    return ' '.join([lm.lemmatize(words) for words in tokenized])

Function to Preprocess the text by applying all above functions¶

In [59]:
def preprocess(text):
    text = strip_emoji(text)
    text = decontract(text)
   # text = strip_all_entities(text)
    text = clean_hashtags(text)
    text = filter_chars(text)
    text = remove_mult_spaces(text)
    text = stemmer(text)
    text = lemmatize(text)
    return text
In [60]:
import nltk
nltk.download('omw-1.4')
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Vidit\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Out[60]:
True
In [61]:
df_review_resampled.head()
Out[61]:
text review_type
0 Being really picky here as all was great but ... Good_review
1 We were given unbeknown to us a handicap acce... Good_review
2 Location a little restrictive Hotel facilities Good_review
3 Staff service at the bar was appalling Bed w... Good_review
4 No information in rooms about London and some... Good_review
In [62]:
df_review_resampled['review_type'].nunique()
Out[62]:
2
In [63]:
review_type=['Good_review','Bad_review']
In [64]:
df_review_resampled['cleaned_text']=df_review_resampled['text'].apply(preprocess)
df_review_resampled.head()
Out[64]:
text review_type cleaned_text
0 Being really picky here as all was great but ... Good_review be realli picki here a all wa great but a coup...
1 We were given unbeknown to us a handicap acce... Good_review we were given unbeknown to u a handicap access...
2 Location a little restrictive Hotel facilities Good_review locat a littl restrict hotel facil
3 Staff service at the bar was appalling Bed w... Good_review staff servic at the bar wa appal bed wa veri c...
4 No information in rooms about London and some... Good_review no inform in room about london and some staff ...

Cleaned text added¶

Dealing with Duplicates¶

In [65]:
df_review_resampled["cleaned_text"].duplicated().sum()
Out[65]:
4604
In [66]:
df_review_resampled.drop_duplicates("cleaned_text", inplace=True)

Duplicates removed¶

Tokenization¶

In [67]:
df_review_resampled['review_list'] = df_review_resampled['cleaned_text'].apply(word_tokenize)
df_review_resampled.head()
Out[67]:
text review_type cleaned_text review_list
0 Being really picky here as all was great but ... Good_review be realli picki here a all wa great but a coup... [be, realli, picki, here, a, all, wa, great, b...
1 We were given unbeknown to us a handicap acce... Good_review we were given unbeknown to u a handicap access... [we, were, given, unbeknown, to, u, a, handica...
2 Location a little restrictive Hotel facilities Good_review locat a littl restrict hotel facil [locat, a, littl, restrict, hotel, facil]
3 Staff service at the bar was appalling Bed w... Good_review staff servic at the bar wa appal bed wa veri c... [staff, servic, at, the, bar, wa, appal, bed, ...
4 No information in rooms about London and some... Good_review no inform in room about london and some staff ... [no, inform, in, room, about, london, and, som...

Removing text without words¶

In [68]:
text_len = []
for text in df_review_resampled.review_list:
    review_len = len(text)
    text_len.append(review_len)
df_review_resampled['text_len'] = text_len
In [69]:
df_review_resampled=df_review_resampled[df_review_resampled['text_len']!=0]
In [70]:
df_review_resampled.shape
Out[70]:
(167745, 5)
In [71]:
from sklearn.preprocessing import LabelEncoder
encoder =LabelEncoder()
encoded_review = encoder.fit_transform(df_review_resampled.review_type.values)
In [72]:
encoded_review[1:5]
Out[72]:
array([1, 1, 1, 1])
In [73]:
from sklearn.model_selection import train_test_split, GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(
    df_review_resampled.text, 
    encoded_review, 
    test_size=0.25, 
    random_state=42
  )
In [74]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[74]:
((125808,), (41937,), (125808,), (41937,))

Feature Engineering¶

TF-IDF¶

In [75]:
tf_idf = TfidfVectorizer()
X_train_tf = tf_idf.fit_transform(X_train)
X_test_tf = tf_idf.transform(X_test)
print(X_train_tf.shape)
print(X_test_tf.shape)
(125808, 43624)
(41937, 43624)

Trying Different ML Models¶

Logistic Regression¶

In [76]:
lr=LogisticRegression()
In [77]:
lr_cv_score=cross_val_score(lr,X_train_tf,y_train,cv=5,scoring='f1_macro',n_jobs=-1)
In [78]:
mean_lr_cv = np.mean(lr_cv_score)
mean_lr_cv
Out[78]:
0.8260478114536053

Support Vector Classifier¶

In [79]:
lin_svc = LinearSVC()
In [80]:
lin_svc_cv_score = cross_val_score(lin_svc,X_train_tf,y_train,cv=5,scoring='f1_macro',n_jobs=-1)
mean_lin_svc_cv = np.mean(lin_svc_cv_score)
mean_lin_svc_cv
Out[80]:
0.8181824314320018

Naive Bayes Classifier¶

In [81]:
multiNB = MultinomialNB()
multiNB_cv_score = cross_val_score(multiNB,X_train_tf,y_train,cv=5,scoring='f1_macro',n_jobs=-1)
mean_multiNB_cv = np.mean(multiNB_cv_score)
mean_multiNB_cv
Out[81]:
0.7969156929095422

Decision Tree Classifier¶

In [82]:
dtree = DecisionTreeClassifier()
dtree_cv_score = cross_val_score(dtree,X_train_tf,y_train,cv=5,scoring='f1_macro',n_jobs=-1)
mean_dtree_cv = np.mean(dtree_cv_score)
mean_dtree_cv
Out[82]:
0.709411235999091

Random Forest Classifier¶

In [83]:
rand_forest = RandomForestClassifier()
rand_forest_cv_score = cross_val_score(rand_forest,X_train_tf,y_train,cv=5,scoring='f1_macro',n_jobs=-1)
In [84]:
mean_rand_forest_cv = np.mean(rand_forest_cv_score)
mean_rand_forest_cv
Out[84]:
0.7973628754332519

Adaboost Classifier¶

In [85]:
adab=AdaBoostClassifier()
In [86]:
adab_cv_score = cross_val_score(adab,X_train_tf,y_train,cv=5,scoring='f1_macro',n_jobs=-1)
mean_adab_cv = np.mean(adab_cv_score)
mean_adab_cv
Out[86]:
0.7739219811500005

By trying different models we can see logistic regression and svm performed similarly, so among these we will go with svm model as it is more generalised and light.

Fine Tuning SVC¶

In [87]:
svc1 = LinearSVC()
param_grid = {'C':[0.0001,0.001,0.01,0.1,1,10],
'loss':['hinge','squared_hinge'],
'fit_intercept':[True,False]}
grid_search = GridSearchCV(svc1,param_grid,cv=5,scoring='f1_macro',n_jobs=-1,verbose=0,return_train_score=True)
grid_search.fit(X_train_tf,y_train)
Out[87]:
GridSearchCV(cv=5, estimator=LinearSVC(), n_jobs=-1,
             param_grid={'C': [0.0001, 0.001, 0.01, 0.1, 1, 10],
                         'fit_intercept': [True, False],
                         'loss': ['hinge', 'squared_hinge']},
             return_train_score=True, scoring='f1_macro')
In [88]:
grid_search.best_estimator_
Out[88]:
LinearSVC(C=0.1)
In [89]:
grid_search.best_score_
Out[89]:
0.8259393847278773

Model Evaluation¶

In [90]:
lin_svc.fit(X_train_tf,y_train)
y_pred = lin_svc.predict(X_test_tf)
In [91]:
def print_confusion_matrix(confusion_matrix, class_names, figsize = (10,7), fontsize=14):
    df_cm = pd.DataFrame(confusion_matrix, index=class_names, columns=class_names)
    fig = plt.figure(figsize=figsize)
    try:
        heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
    except ValueError:
        raise ValueError("Confusion matrix values must be integers.")
    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize)
    plt.ylabel('Truth')
    plt.xlabel('Prediction')
In [92]:
cm = confusion_matrix(y_test,y_pred)
print_confusion_matrix(cm,review_type)
In [93]:
print('Classification Report:\n',classification_report(y_test, y_pred, target_names=review_type))
Classification Report:
               precision    recall  f1-score   support

 Good_review       0.81      0.83      0.82     21213
  Bad_review       0.83      0.80      0.81     20724

    accuracy                           0.82     41937
   macro avg       0.82      0.82      0.82     41937
weighted avg       0.82      0.82      0.82     41937

Saving the model¶

In [96]:
pickle.dump(tf_idf, open('hotel_reviews.pkl', 'wb'))
pickle.dump(lin_svc, open('hotelreviews.pkl', 'wb'))

6. Final Report¶

Sentiment analysis was done using different ML algorithms including Logistic Regression, Decision Tree, Random Forest, Naive Bayes and SVM.

Maximum accuracy achieved was in the logistic regression & SVM of 82.6% & 82% respectively.